Question 1

1.a. Perform an exploratory data analysis (EDA) to extract different data characteristics. Perform necessary manipulations (e.g., treatment of outliers). Use 95th percentile for outlier capping, if needed. Comment.
library(ggplot2)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(magrittr)

First, in order to make the problem/dataset more interesting, we categorize the alcohol variable (which is initially numerical) into three levels: low, medium, and high. The separation rule is as follows:

  • Low (levels.alcohol == 1) when alcohol is below \(10.5 \%\).

  • Medium (levels.alcohol == 2) when alcohol is below \(12 \%\).

  • High (levels.alcohol == 3) when alcohol is above \(12 \%\).

wineDataset <- read.csv("/Users/philippebeliveau/Downloads/winequality-red.csv")

alcohol_levels <- wineDataset %>% select(alcohol) %>% mutate("Low_alcohol" = alcohol<=10.5, "normal_alcohol" = alcohol>10.5 & alcohol<=12, 
                                                      "high_alcohol"= alcohol>12)

alcohol_levels$Low_alcohol <- as.integer(as.logical(alcohol_levels$Low_alcohol))
alcohol_levels$normal_alcohol <- as.integer(as.logical(alcohol_levels$normal_alcohol))
alcohol_levels$high_alcohol <- as.integer(as.logical(alcohol_levels$high_alcohol))
alcohol_levels <- alcohol_levels[, -1]

wineDataset <- cbind(wineDataset, alcohol_levels)

wine_t <- wineDataset %>% mutate(levels.alcohol=
                            case_when(Low_alcohol==1~1, normal_alcohol==1~2, high_alcohol==1~3))

wineDataset <- wine_t[, c(1:10, 12, 16)]

We’ll start off by exploring the data…

wineDataset <- wineDataset[complete.cases(wineDataset), ]

str(wineDataset)
## 'data.frame':    1599 obs. of  12 variables:
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...
##  $ levels.alcohol      : num  1 1 1 1 1 1 1 1 1 1 ...
head(wineDataset)
##   fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1           7.4             0.70        0.00            1.9     0.076
## 2           7.8             0.88        0.00            2.6     0.098
## 3           7.8             0.76        0.04            2.3     0.092
## 4          11.2             0.28        0.56            1.9     0.075
## 5           7.4             0.70        0.00            1.9     0.076
## 6           7.4             0.66        0.00            1.8     0.075
##   free.sulfur.dioxide total.sulfur.dioxide density   pH sulphates quality
## 1                  11                   34  0.9978 3.51      0.56       5
## 2                  25                   67  0.9968 3.20      0.68       5
## 3                  15                   54  0.9970 3.26      0.65       5
## 4                  17                   60  0.9980 3.16      0.58       6
## 5                  11                   34  0.9978 3.51      0.56       5
## 6                  13                   40  0.9978 3.51      0.56       5
##   levels.alcohol
## 1              1
## 2              1
## 3              1
## 4              1
## 5              1
## 6              1
colSums(is.na(wineDataset))
##        fixed.acidity     volatile.acidity          citric.acid 
##                    0                    0                    0 
##       residual.sugar            chlorides  free.sulfur.dioxide 
##                    0                    0                    0 
## total.sulfur.dioxide              density                   pH 
##                    0                    0                    0 
##            sulphates              quality       levels.alcohol 
##                    0                    0                    0

Comment: There are several observations:

  • The dataset contains 1599 observations.

  • The dataset has 11 explanatory variables and doesn’t have any NA values.

  • All variables are continuous variables, except the level of alcohol, which we will treat as a categorical variable with three distinct categories.

1.a.i Descriptive Statistics

Some descriptive statistics:

summary<-sapply(wineDataset,function(x) c(mean(x),sd(x),min(x),max(x),length(x)))
row.names(summary)<-c("mean","sd","min","max","n")
summary
##      fixed.acidity volatile.acidity  citric.acid residual.sugar    chlorides
## mean      8.319637        0.5278205    0.2709756       2.538806 8.746654e-02
## sd        1.741096        0.1790597    0.1948011       1.409928 4.706530e-02
## min       4.600000        0.1200000    0.0000000       0.900000 1.200000e-02
## max      15.900000        1.5800000    1.0000000      15.500000 6.110000e-01
## n      1599.000000     1599.0000000 1599.0000000    1599.000000 1.599000e+03
##      free.sulfur.dioxide total.sulfur.dioxide      density           pH
## mean            15.87492             46.46779 9.967467e-01    3.3111132
## sd              10.46016             32.89532 1.887334e-03    0.1543865
## min              1.00000              6.00000 9.900700e-01    2.7400000
## max             72.00000            289.00000 1.003690e+00    4.0100000
## n             1599.00000           1599.00000 1.599000e+03 1599.0000000
##         sulphates      quality levels.alcohol
## mean    0.6581488    5.6360225      1.4734209
## sd      0.1695070    0.8075694      0.6526256
## min     0.3300000    3.0000000      1.0000000
## max     2.0000000    8.0000000      3.0000000
## n    1599.0000000 1599.0000000   1599.0000000

Comment: We notice that variables have different scales. Certain variables such as free.sulfur.dioxide and total.sulfur.dioxide have a large standard deviation, whereas others like density have very small standard deviation.

We are looking for features that hold significant amount of information about our response variable. Thus, we are especially looking for features with high standard deviation. free.sulfur.dioxide and total.sulfur.dioxide seems to be good candidates for helping us explain the quality of the wine.

1.a.ii Distribution of the variables:
Histogram of the explanatory variables:
# Histograms
par(mfrow=c(2,2))
hist(wineDataset$fixed.acidity,xlab="fixed acidity",main="Histogram")
hist(wineDataset$volatile.acidity,xlab="volatile acidity",main="Histogram")
hist(wineDataset$citric.acid,xlab="citric acid",main="Histogram")
hist(wineDataset$residual.sugar,xlab="residual sugar",main="Histogram")

par(mfrow=c(2,2))
hist(wineDataset$chlorides,xlab="chlorides",main="Histogram")
hist(wineDataset$free.sulfur.dioxide,xlab="free.sulfur.dioxide",main="Histogram")
hist(wineDataset$total.sulfur.dioxide,xlab="total.sulfur.dioxide",main="Histogram")
hist(wineDataset$density,xlab="density",main="Histogram")

par(mfrow=c(2,2))
hist(wineDataset$pH,xlab="pH",main="Histogram")
hist(wineDataset$sulphates,xlab="sulphates",main="Histogram")
hist(wineDataset$levels.alcohol,xlab="alcohol",main="Histogram")

Histogram of the response variable (quality):
hist(wineDataset$quality,xlab="quality",main="Histogram")

Comment: Some observations include:

  • We see that the majority of our explanatory variables are slightly skewed to the right. We could then consider using a different correlation coefficient that doesn’t require normality between the explanatory variable to assess the correlation between them.

  • We also see that our response variable is relatively normally distributed. It is expected; low and high quality wines are less observed, while average quality wines are the most frequent. There are more than 68.9% of the observations that are within one standard deviation, thus the distribution is not perfectly normally distributed.

  • Another interesting observation includes the imbalance of the levels of alcohol. We see that most wines have a low level of alcohol. It could be interesting to look into more details at the characteristic that compose higher level of alcohol.

1.a.iii Scatterplots

We can also examine the relationship between the response variable and the explanatory variables using scatterplots.

Fixed acidity and volatile acidity vs. quality scatter plots
par(mfrow=c(1,2))
# quality, fixed.acidity
plot(quality~fixed.acidity,xlab="fixed acidity",ylab="quality",data=wineDataset,lwd=1.5)
abline(lm(quality~fixed.acidity,data=wineDataset),lwd=1.5)

# quality, volatile.acidity
plot(quality~volatile.acidity,xlab="volatile acidity",ylab="quality",data=wineDataset,lwd=1.5)
abline(lm(quality~volatile.acidity,data=wineDataset),lwd=1.5)

citric acid and pH vs. quality scatter plots
par(mfrow=c(1,2))
# quality, citric.acid
plot(quality~citric.acid,xlab="citric acid",ylab="quality",data=wineDataset,lwd=1.5)
abline(lm(quality~citric.acid,data=wineDataset),lwd=1.5)

# quality, pH
plot(quality~pH,xlab="pH",ylab="quality",data=wineDataset,lwd=1.5)
abline(lm(quality~pH,data=wineDataset),lwd=1.5)

residual sugar and chlorides vs. quality scatter plots
par(mfrow=c(1,2))
# quality, residual.sugar
plot(quality~residual.sugar,xlab="residual sugar",ylab="quality",data=wineDataset,lwd=1.5)
abline(lm(quality~residual.sugar,data=wineDataset),lwd=1.5)

# quality, chlorides
plot(quality~chlorides,xlab="chlorides",ylab="quality",data=wineDataset,lwd=1.5)
abline(lm(quality~chlorides,data=wineDataset),lwd=1.5)

free.sulfur.dioxide and total.sulfur.dioxide vs. quality scatter plots
par(mfrow=c(1,2))
# quality, free.sulfur.dioxide
plot(quality~free.sulfur.dioxide,xlab="free sulfur dioxide",ylab="quality",data=wineDataset,lwd=1.5)
abline(lm(quality~free.sulfur.dioxide,data=wineDataset),lwd=1.5)

# quality, total.sulfur.dioxide
plot(quality~total.sulfur.dioxide,xlab="total sulfur dioxide",ylab="quality",data=wineDataset,lwd=1.5)
abline(lm(quality~total.sulfur.dioxide,data=wineDataset),lwd=1.5)

density and alcohol vs. quality scatter plots
par(mfrow=c(1,2))
# quality, density
plot(quality~density,xlab="density",ylab="quality",data=wineDataset,lwd=1.5)
abline(lm(quality~density,data=wineDataset),lwd=1.5)

sulphates vs. quality scatter plot
par(mfrow=c(1,2))
# quality, sulphates
plot(quality~sulphates,xlab="sulphates",ylab="quality",data=wineDataset,lwd=1.5)
abline(lm(quality~sulphates,data=wineDataset),lwd=1.5)

Comment: Highlights from this inspection include:

  • There seems to have a strong negative correlation between the volatile acidity and the quality of the wine. Meaning that low quality wine have significantly higher volatile acidity level than good quality wine.

  • We also see that having higher levels of chloride seems to be associated with lower quality of wine.

  • The graph of quality and density seems interesting, as the value of density seems to be able to distinguish between low quality wine (3,5) to decent and very good wine (6,8). This make us believe that the level of density might be a good candidate to distinguish the wine quality.

Conclusion: These graphs only tell a limited story, as they examine the relationship between quality and one explanatory variable at a time.By looking at these graphs, it is not possible to deduct how these variables effect quality of wine together.

1.a.iv. Treatment of outliers

From the EDA, we observe that some variables might have outliers (e.g., total sulfur dioxide). It is therefore critical to treat them before modeling. We will use the common method of capping the lowest and the highest \(5\%\) of the observations for all the numerical variables.

library(dlookr)
## Warning in !is.null(rmarkdown::metadata$output) && rmarkdown::metadata$output
## %in% : 'length(x) = 2 > 1' in coercion to 'logical(1)'
## 
## Attaching package: 'dlookr'
## The following object is masked from 'package:magrittr':
## 
##     extract
## The following object is masked from 'package:base':
## 
##     transform
wineDataset$fixed.acidity <- imputate_outlier(wineDataset, fixed.acidity, method = "capping")
wineDataset$volatile.acidity <- imputate_outlier(wineDataset, volatile.acidity, method = "capping")
wineDataset$citric.acid <- imputate_outlier(wineDataset, citric.acid, method = "capping")
wineDataset$pH <- imputate_outlier(wineDataset, pH, method = "capping")

wineDataset$residual.sugar <- imputate_outlier(wineDataset, residual.sugar, method = "capping")
wineDataset$chlorides <- imputate_outlier(wineDataset, chlorides, method = "capping")
wineDataset$free.sulfur.dioxide <- imputate_outlier(wineDataset, free.sulfur.dioxide, method = "capping")
wineDataset$total.sulfur.dioxide <- imputate_outlier(wineDataset, total.sulfur.dioxide, method = "capping")

wineDataset$density <- imputate_outlier(wineDataset, density, method = "capping")
wineDataset$sulphates <- imputate_outlier(wineDataset, sulphates, method = "capping")

Comment: Note that the default capping value of the imputate_outlier function is 95%, therefore we don’t need to specify the capping value here.

1.b. Inspect the correlation between the variables. Is there any collinearity between the features in the dataset? If yes, provide a possible solution.
library("corrplot")
## corrplot 0.92 loaded
correlations <- cor(wineDataset)

corrplot(correlations, type = "upper", order = "hclust", 
         tl.col = "black", tl.srt = 45)

corrplot(correlations, type = "upper",  order = "hclust", 
         tl.col = "black", tl.srt = 45, method="number", number.cex=0.60)

Comment: From the graph, we see that the pH level and the fixed acidity are strongly negatively correlated, while the fixed acidity and density are strongly positively correlated, with correlation coefficients of -0.68 and 0.67, respectively. Moreover, free sulfur dioxide and total sulfur dioxide have a correlation coefficient of 0.67, meaning they have a strong positive correlation.

library(car)
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
## 
##     recode
mod.col1<-lm(quality~fixed.acidity+volatile.acidity+citric.acid+residual.sugar+chlorides+free.sulfur.dioxide+total.sulfur.dioxide+density+pH+sulphates+as.factor(levels.alcohol),data=wineDataset)
summary(mod.col1)
## 
## Call:
## lm(formula = quality ~ fixed.acidity + volatile.acidity + citric.acid + 
##     residual.sugar + chlorides + free.sulfur.dioxide + total.sulfur.dioxide + 
##     density + pH + sulphates + as.factor(levels.alcohol), data = wineDataset)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.77378 -0.37303 -0.04796  0.44652  2.02834 
## 
## Coefficients:
##                              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                 6.865e+01  2.040e+01   3.365 0.000783 ***
## fixed.acidity               7.763e-02  2.484e-02   3.125 0.001808 ** 
## volatile.acidity           -1.020e+00  1.294e-01  -7.878 6.12e-15 ***
## citric.acid                -2.722e-01  1.463e-01  -1.861 0.062908 .  
## residual.sugar              3.796e-02  2.066e-02   1.837 0.066332 .  
## chlorides                  -3.347e+00  1.001e+00  -3.344 0.000847 ***
## free.sulfur.dioxide         4.344e-03  2.477e-03   1.754 0.079707 .  
## total.sulfur.dioxide       -3.362e-03  8.591e-04  -3.913 9.49e-05 ***
## density                    -6.349e+01  2.097e+01  -3.027 0.002507 ** 
## pH                         -1.865e-01  1.889e-01  -0.987 0.323612    
## sulphates                   1.479e+00  1.400e-01  10.566  < 2e-16 ***
## as.factor(levels.alcohol)2  3.238e-01  4.818e-02   6.721 2.50e-11 ***
## as.factor(levels.alcohol)3  7.259e-01  8.148e-02   8.909  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6499 on 1586 degrees of freedom
## Multiple R-squared:  0.3573, Adjusted R-squared:  0.3525 
## F-statistic: 73.48 on 12 and 1586 DF,  p-value: < 2.2e-16
vif(mod.col1)
##                               GVIF Df GVIF^(1/(2*Df))
## fixed.acidity             6.102908  1        2.470407
## volatile.acidity          1.813828  1        1.346784
## citric.acid               3.049953  1        1.746411
## residual.sugar            1.500429  1        1.224920
## chlorides                 1.308109  1        1.143726
## free.sulfur.dioxide       2.107918  1        1.451867
## total.sulfur.dioxide      2.440603  1        1.562243
## density                   4.974059  1        2.230260
## pH                        2.799577  1        1.673194
## sulphates                 1.272692  1        1.128137
## as.factor(levels.alcohol) 2.494696  2        1.256766

Comment: From the VIF table, we observe that fixed acidity has the highest VIF of 6.10, followed by density with a VIF of 4.97. We will therefore proceed with a model “without” the fixed acidity variable and see if the VIF values decrease.

mod.col2<-lm(quality~volatile.acidity+citric.acid+residual.sugar+chlorides+free.sulfur.dioxide+total.sulfur.dioxide+density+pH+sulphates+as.factor(levels.alcohol),data=wineDataset)
summary(mod.col2)
## 
## Call:
## lm(formula = quality ~ volatile.acidity + citric.acid + residual.sugar + 
##     chlorides + free.sulfur.dioxide + total.sulfur.dioxide + 
##     density + pH + sulphates + as.factor(levels.alcohol), data = wineDataset)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.65517 -0.38253 -0.04588  0.44502  2.05982 
## 
## Coefficients:
##                              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                 2.343e+01  1.442e+01   1.625 0.104387    
## volatile.acidity           -1.002e+00  1.297e-01  -7.729 1.91e-14 ***
## citric.acid                -1.210e-01  1.384e-01  -0.874 0.382017    
## residual.sugar              1.955e-02  1.986e-02   0.984 0.325044    
## chlorides                  -3.783e+00  9.941e-01  -3.806 0.000147 ***
## free.sulfur.dioxide         5.117e-03  2.472e-03   2.070 0.038586 *  
## total.sulfur.dioxide       -4.107e-03  8.276e-04  -4.963 7.69e-07 ***
## density                    -1.614e+01  1.454e+01  -1.110 0.267209    
## pH                         -5.738e-01  1.429e-01  -4.016 6.21e-05 ***
## sulphates                   1.437e+00  1.397e-01  10.281  < 2e-16 ***
## as.factor(levels.alcohol)2  3.771e-01  4.519e-02   8.344  < 2e-16 ***
## as.factor(levels.alcohol)3  8.165e-01  7.635e-02  10.693  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6516 on 1587 degrees of freedom
## Multiple R-squared:  0.3534, Adjusted R-squared:  0.3489 
## F-statistic: 78.84 on 11 and 1587 DF,  p-value: < 2.2e-16
vif(mod.col2)
##                               GVIF Df GVIF^(1/(2*Df))
## volatile.acidity          1.810440  1        1.345526
## citric.acid               2.716333  1        1.648130
## residual.sugar            1.378407  1        1.174056
## chlorides                 1.282706  1        1.132566
## free.sulfur.dioxide       2.086896  1        1.444609
## total.sulfur.dioxide      2.252512  1        1.500837
## density                   2.378154  1        1.542126
## pH                        1.593939  1        1.262513
## sulphates                 1.260527  1        1.122732
## as.factor(levels.alcohol) 2.088804  2        1.202194

Comment: We see that when the fixed acidity variable is excluded from the model, all the VIF values are below a reasonable threshold of 3. Based on the VIF values, we remove fixed acidity.

1.c. Inspect the interaction between alcohol level and density on the quality of the wine. Does there seem to be an interaction between the two variables? Comment.

Comment: We see that the slopes for different levels of alcohol are not the same, which could be a sign of interaction between alcohol and density. More specifically, line 1 and line 3 are not parallel, i.e., the effect of the level of alcohol on quality changes as a function of density. We see that the higher the value of density, the smaller the difference in quality of wine between low and high levels of alcohol. Although this interaction looks weak.

1.d. Model the quality of wine in terms of all of the explanatory variables including an interaction between alcohol and density. Provide the model summary results.

The linear regression is of the form:

\[ \begin{align} Y = &\beta_{0} + \beta_{1}volatile.acidity + \beta_{2}citric.acid + \beta_{3}residual.sugar + \beta_{4}chlorides + \beta_{5}free.sulfur.dioxide + \\ &\beta_{6}total.sulfur.dioxide + \beta_{7}density + \beta_{8}pH + \beta_{9}sulphates + \beta_{10}medium.alcohol + \beta_{11}high.alcohol + \\ & \beta_{12}(medium.alcohol*density) + \beta_{13}(high.alcohol*density) + \epsilon \end{align} \]

mod<-lm(quality~volatile.acidity+citric.acid+residual.sugar+chlorides+free.sulfur.dioxide+total.sulfur.dioxide+density+pH+sulphates+as.factor(levels.alcohol)+as.factor(levels.alcohol)*density,data=wineDataset)
summary(mod)
## 
## Call:
## lm(formula = quality ~ volatile.acidity + citric.acid + residual.sugar + 
##     chlorides + free.sulfur.dioxide + total.sulfur.dioxide + 
##     density + pH + sulphates + as.factor(levels.alcohol) + as.factor(levels.alcohol) * 
##     density, data = wineDataset)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.6574 -0.3801 -0.0422  0.4469  2.0606 
## 
## Coefficients:
##                                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                         2.116e+01  1.787e+01   1.184 0.236674    
## volatile.acidity                   -1.001e+00  1.299e-01  -7.703 2.33e-14 ***
## citric.acid                        -1.208e-01  1.387e-01  -0.871 0.384070    
## residual.sugar                      1.958e-02  1.995e-02   0.981 0.326542    
## chlorides                          -3.766e+00  9.986e-01  -3.772 0.000168 ***
## free.sulfur.dioxide                 5.110e-03  2.476e-03   2.064 0.039182 *  
## total.sulfur.dioxide               -4.113e-03  8.305e-04  -4.953 8.10e-07 ***
## density                            -1.385e+01  1.801e+01  -0.769 0.441777    
## pH                                 -5.759e-01  1.434e-01  -4.017 6.17e-05 ***
## sulphates                           1.435e+00  1.400e-01  10.255  < 2e-16 ***
## as.factor(levels.alcohol)2          4.967e+00  2.330e+01   0.213 0.831242    
## as.factor(levels.alcohol)3          5.978e+00  3.485e+01   0.172 0.863837    
## density:as.factor(levels.alcohol)2 -4.604e+00  2.338e+01  -0.197 0.843891    
## density:as.factor(levels.alcohol)3 -5.183e+00  3.502e+01  -0.148 0.882383    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.652 on 1585 degrees of freedom
## Multiple R-squared:  0.3534, Adjusted R-squared:  0.3481 
## F-statistic: 66.63 on 13 and 1585 DF,  p-value: < 2.2e-16
1.e. Based on the model in part (d), write an expression for the fitted model for each category of alcohol. Comment on the value of R2.

The fitted model has the form:

\[ \begin{align} \hat{quality} = &(2.116e+01) - (1.001e+00)volatile.acidity -(1.208e-01)citric.acid + \\ & (1.958e-02)residual.sugar - (3.766e+00)chlorides + (5.110e-03)free.sulfur.dioxide - \\ &(4.113e-03)total.sulfur.dioxide -(1.385e+01)density -(5.759e-01)pH + \\ & (1.435e+00)sulphates + (4.967e+00)medium.alcohol + (5.978e+00)high.alcohol - \\ &(4.604e+00)medium.alcohol*density - (5.183e+00)high.alcohol*density \end{align} \]

We can split up the model based on the level of alcohol:

When the level of alcohol is \(low\):

\[ \begin{align} &\hat{E}(quality|alcohol=low) = (2.116e+01) - (1.001e+00)volatile.acidity - \\ &(1.208e-01)citric.acid + (1.958e-02)residual.sugar - (3.766e+00)chlorides + \\ & (5.110e-03)free.sulfur.dioxide -(4.113e-03)total.sulfur.dioxide -(1.385e+01)density - \\ & (5.759e-01)pH + (1.435e+00)sulphates \end{align} \]

When the level of alcohol is \(medium\):

\[ \begin{align} &\hat{E}(quality|alcohol=medium) = (2.116e+01 + 4.967e+00) - (1.001e+00)volatile.acidity - \\ &(1.208e-01)citric.acid + (1.958e-02)residual.sugar - (3.766e+00)chlorides + \\ & (5.110e-03)free.sulfur.dioxide -(4.113e-03)total.sulfur.dioxide - \\ & (1.385e+01+4.604e+00) density -(5.759e-01)pH + (1.435e+00)sulphates \\ \\ & = 2.613e+01 - (1.001e+00)volatile.acidity -(1.208e-01)citric.acid + \\ & (1.958e-02)residual.sugar - (3.766e+00)chlorides (5.110e-03)free.sulfur.dioxide - \\ &(4.113e-03)total.sulfur.dioxide - 18.454 density -(5.759e-01)pH + (1.435e+00)sulphates \end{align} \]

When the level of alcohol is \(high\):

\[ \begin{align} &\hat{E}(quality|alcohol=high) = (2.116e+01 + 5.978e+00) - (1.001e+00)volatile.acidity - \\ &(1.208e-01)citric.acid + (1.958e-02)residual.sugar - (3.766e+00)chlorides + \\ & (5.110e-03)free.sulfur.dioxide -(4.113e-03)total.sulfur.dioxide - \\ & (1.385e+01+ 5.183e+00)density -(5.759e-01)pH + (1.435e+00)sulphates \\ \\ & = 2.714e+01 - (1.001e+00)volatile.acidity -(1.208e-01)citric.acid + \\ & (1.958e-02)residual.sugar - (3.766e+00)chlorides + (5.110e-03)free.sulfur.dioxide - \\ & (4.113e-03)total.sulfur.dioxide - 19.033density -(5.759e-01)pH + (1.435e+00)sulphates \end{align} \]

\(R^2\) is 0.3534, meaning that all the explanatory variables in the model and the interaction variables can explain 36.15% of the variability in the response variable (quality).

Adjusted \(R^2\) = 0.3481.

1.f. Based on the model in part (d), interpret the regression coefficient associated with residual sugar, and the main effect of density.

Interpreting the regression coefficient associated with residual sugar (\(\beta_3\)):

\[ \begin{align} &\beta_3 = E(quality|residual sugar = x + 1) - \\ &E(quality|residual sugar = x) \end{align} \]

On average, one unit increase in residual sugar parameter causes quality of wine to increase by 1.958e-02, holding all other variables constant.

Interpreting the regression coefficient associated with the main effect of density.(\(\beta_7\)):

\[ \begin{align} &\beta_7 = E(quality|density = x + 1, alcohol=low) - \\ &E(quality|density = x,new,alcohol=low) \end{align} \]

On average , one unit increase in density causes quality of wine with low alcohol to change by -(1.385e+01) holding all other variables constant. In other words, on average, one unit increase in density causes quality of wine to decrease by 1.385e+01, when alcohol = low, holding all other variables constant.

1.g. Based on the model in part (d), does the effect of density on the quality depend on the level of alcohol? (categorized as low/medium/high)? Justify your answer.

Here we are interested in testing the hypotheses:

\(H_0: \beta_{12} = \beta_{13} =0\) vs. \(H_1:\) at least one of them \(\neq 0\)

To test the hypothesis, we proceed to build a new model called mod2, which doesn’t contain the interaction variables \(medium.alcohol*density\) and \(high.alcohol*density\).

mod2 <- lm(quality~volatile.acidity+citric.acid+residual.sugar+chlorides+free.sulfur.dioxide+total.sulfur.dioxide+density+pH+sulphates+as.factor(levels.alcohol),data=wineDataset)

We then compare mod and mod2 using anova function.

anova(mod, mod2)
## Analysis of Variance Table
## 
## Model 1: quality ~ volatile.acidity + citric.acid + residual.sugar + chlorides + 
##     free.sulfur.dioxide + total.sulfur.dioxide + density + pH + 
##     sulphates + as.factor(levels.alcohol) + as.factor(levels.alcohol) * 
##     density
## Model 2: quality ~ volatile.acidity + citric.acid + residual.sugar + chlorides + 
##     free.sulfur.dioxide + total.sulfur.dioxide + density + pH + 
##     sulphates + as.factor(levels.alcohol)
##   Res.Df    RSS Df Sum of Sq      F Pr(>F)
## 1   1585 673.89                           
## 2   1587 673.91 -2 -0.020233 0.0238 0.9765

This leads to a test statistic value (F-test) of 0.0238 and a p-value of 0.97. For the reasonable significance level of \(\alpha=0.05\), we fail to reject \(H_0\) and conclude that the interaction between alcohol and density is not significant, that is, the effect of density on the quality of wine does not depend on the level of alcohol, and vice-versa, keeping all other variables fixed.

1.h. Test whether the alcohol variable is globally significant in the model, using an F-test. Justify your answer. Use 𝛼 = 1%.

Here we are interested in testing the hypotheses:

\(H_0: \beta_{10} = \beta_{11} = \beta_{12} = \beta_{13} =0\) vs. \(H_1:\) at least one of them \(\neq 0\)

To test this hypothesis, we build a new model mod3, which excludes the variables \(medium.alcohol\), \(high.alcohol\), \(medium.alcohol*density\), and \(high.alcohol*density\).

mod3 <- lm(quality~volatile.acidity+citric.acid+residual.sugar+chlorides+free.sulfur.dioxide+total.sulfur.dioxide+density+pH+sulphates,data=wineDataset)

We then compare mod and mod3 using anova function.

anova(mod, mod3)
## Analysis of Variance Table
## 
## Model 1: quality ~ volatile.acidity + citric.acid + residual.sugar + chlorides + 
##     free.sulfur.dioxide + total.sulfur.dioxide + density + pH + 
##     sulphates + as.factor(levels.alcohol) + as.factor(levels.alcohol) * 
##     density
## Model 2: quality ~ volatile.acidity + citric.acid + residual.sugar + chlorides + 
##     free.sulfur.dioxide + total.sulfur.dioxide + density + pH + 
##     sulphates
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1   1585 673.89                                  
## 2   1589 728.12 -4   -54.228 31.886 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

This leads to a test statistic value (F-test) of 31.886 and a p-value of 2.2e-16. For the significance level of \(\alpha=0.01\), we reject \(H_0\) and conclude that the variable alcohol is globally significant.

1.i Based on the model in part (d), is there a significant difference in the effect of density on the quality of wine with a high level of alcohol in comparison to wine with medium level of alcohol? Justify your answer.

In order to compare the high level of alcohol with medium level of alcohol, we need to relevel the \(alcohol\) variable and build the model again with the re-leveled version of \(alcohol\).

wineDataset$levels.alcohol <-relevel(as.factor(wineDataset$levels.alcohol),2)

levels(as.factor(wineDataset$levels.alcohol))
## [1] "2" "1" "3"

Now that \(medium.alcohol\) is the reference level, we can construct the model again.

mod4<-lm(quality~volatile.acidity+citric.acid+residual.sugar+chlorides+free.sulfur.dioxide+total.sulfur.dioxide+density+pH+sulphates+as.factor(levels.alcohol)+as.factor(levels.alcohol)*density,data=wineDataset)
summary(mod4)
## 
## Call:
## lm(formula = quality ~ volatile.acidity + citric.acid + residual.sugar + 
##     chlorides + free.sulfur.dioxide + total.sulfur.dioxide + 
##     density + pH + sulphates + as.factor(levels.alcohol) + as.factor(levels.alcohol) * 
##     density, data = wineDataset)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.6574 -0.3801 -0.0422  0.4469  2.0606 
## 
## Coefficients:
##                                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                         2.613e+01  1.976e+01   1.322 0.186300    
## volatile.acidity                   -1.001e+00  1.299e-01  -7.703 2.33e-14 ***
## citric.acid                        -1.208e-01  1.387e-01  -0.871 0.384070    
## residual.sugar                      1.958e-02  1.995e-02   0.981 0.326542    
## chlorides                          -3.766e+00  9.986e-01  -3.772 0.000168 ***
## free.sulfur.dioxide                 5.110e-03  2.476e-03   2.064 0.039182 *  
## total.sulfur.dioxide               -4.113e-03  8.305e-04  -4.953 8.10e-07 ***
## density                            -1.846e+01  1.987e+01  -0.929 0.353128    
## pH                                 -5.759e-01  1.434e-01  -4.017 6.17e-05 ***
## sulphates                           1.435e+00  1.400e-01  10.255  < 2e-16 ***
## as.factor(levels.alcohol)1         -4.967e+00  2.330e+01  -0.213 0.831242    
## as.factor(levels.alcohol)3          1.011e+00  3.570e+01   0.028 0.977406    
## density:as.factor(levels.alcohol)1  4.604e+00  2.338e+01   0.197 0.843891    
## density:as.factor(levels.alcohol)3 -5.784e-01  3.589e+01  -0.016 0.987143    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.652 on 1585 degrees of freedom
## Multiple R-squared:  0.3534, Adjusted R-squared:  0.3481 
## F-statistic: 66.63 on 13 and 1585 DF,  p-value: < 2.2e-16

We are interested in testing the hypothesis:

\(H_0: \beta_{13} =0\) vs. \(H_1: \beta_{13} \neq 0\).

This leads to a test statistic value (t-value) of -0.016 and a p-value of 0.987143. For the reasonable significance level of \(\alpha=0.05\), we fail to reject \(H_0\) and conclude that the there is not a significant difference in the effect of density on the quality of wine with a high level of alcohol in comparison to wine with medium level of alcohol.

1.j. Carry out a residual analysis. Comment on the results.

In order to asses if the model is well specified and that the t-tests are valid, we make sure that the assumption of normality is met. To do that, we will looking at the residuals.

##   fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1           7.4             0.70        0.00            1.9     0.076
## 2           7.8             0.88        0.00            2.6     0.098
## 3           7.8             0.76        0.04            2.3     0.092
## 4          11.2             0.28        0.56            1.9     0.075
## 5           7.4             0.70        0.00            1.9     0.076
## 6           7.4             0.66        0.00            1.8     0.075
##   free.sulfur.dioxide total.sulfur.dioxide density   pH sulphates quality
## 1                  11                   34  0.9978 3.51      0.56       5
## 2                  25                   67  0.9968 3.20      0.68       5
## 3                  15                   54  0.9970 3.26      0.65       5
## 4                  17                   60  0.9980 3.16      0.58       6
## 5                  11                   34  0.9978 3.51      0.56       5
## 6                  13                   40  0.9978 3.51      0.56       5
##   levels.alcohol   fitted      resid
## 1              1 5.086284 -0.1325670
## 2              1 5.137421 -0.2114159
## 3              1 5.191389 -0.2939047
## 4              1 5.593930  0.6240677
## 5              1 5.086284 -0.1325670
## 6              1 5.113659 -0.1746367
## Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(density)` instead.

Comment: Here it seems that the residuals are normally distributed due to the bell shape curve of the graph. However, we see that the residuals do not have exactly a mean of 0, which could lead us to think that the residuals are not fully normal. Let’s investigate this comment more in details with other residuals graph.

Comment: The residuals fall on average on the line, although we can clearly see a deviation of the residuals in the quantile range (0, -2). As most of the observation are not contained in that range, we can assume that the residuals follow the assumption of normality.

## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

Comment: When looking at the fitted residuals of volatile acidity on the response variable, we see that the assumptions of normality is being respected. The residuals do not show any form of heteroscedasticity (tunnel shape). The residuals seems to stay around their mean.

## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

Comment: Here I can say the same thing as the previous graph, the residuals look pretty normal and constant variance around the mean.

## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

Comment: In terms of residual sugar, I assume that the assumption of normality is still being respected. Most of the residuals are in the same range of value, even though we see extreme value of residuals, they count for very few observations.

Comment: The boxplots seem okay and suggest no significant abnormalities, we could argue that the residuals for group = 1 are not completely following the assumptions of normality, but we believe that it is not significant enough to violate the assumptions of normality across the group (because the mean residuals value is still very close to zero). In short, there are no huge differences in the mean and variances across all the levels, even though the residuals of levels of alcohol for group = 1 doesn’t have a mean residuals of 0.

## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

Comment: Finally, The residual graph concerning the fitted values is very well fitted and looks linear.

Conclusion: In our residual analysis, we plotted a histogram of residuals, a qqplot, residuals vs. covariates, and residuals vs. fitted values graphs. Overall, the histogram and qqplot showed that normality assumption of residuals is satisfied. Furthermore, residuals vs. covariates, and residuals vs. fitted values graphs showed that the relationships are indeed linear and that the assumption of constant variance (that is variance of residuals is constant for all values of independent variables) seems reasonable here. Therefore, the model seems to be well specified.

Question 2

# setwd("/Users/siddhantarora/Desktop/HEC/Fall 2022/Stats Modelling/Final Project")
# df_bank = read.csv("/Users/siddhantarora/Desktop/HEC/Fall 2022/Stats Modelling/Final Project/bank.csv",
#                   sep = ";", header = TRUE)

df_bank = read.csv("/Users/philippebeliveau/Downloads/bank.csv",
                   sep = ";", header = TRUE)

Part a)

Perform a comprehensive exploratory data analysis (EDA) to obtain a better understanding of the dataset and perform necessary manipulations, if needed. Comment.

summary(df_bank)
##       age            job              marital           education        
##  Min.   :19.00   Length:4521        Length:4521        Length:4521       
##  1st Qu.:33.00   Class :character   Class :character   Class :character  
##  Median :39.00   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :41.17                                                           
##  3rd Qu.:49.00                                                           
##  Max.   :87.00                                                           
##    default             balance        housing              loan          
##  Length:4521        Min.   :-3313   Length:4521        Length:4521       
##  Class :character   1st Qu.:   69   Class :character   Class :character  
##  Mode  :character   Median :  444   Mode  :character   Mode  :character  
##                     Mean   : 1423                                        
##                     3rd Qu.: 1480                                        
##                     Max.   :71188                                        
##    contact               day           month              duration   
##  Length:4521        Min.   : 1.00   Length:4521        Min.   :   4  
##  Class :character   1st Qu.: 9.00   Class :character   1st Qu.: 104  
##  Mode  :character   Median :16.00   Mode  :character   Median : 185  
##                     Mean   :15.92                      Mean   : 264  
##                     3rd Qu.:21.00                      3rd Qu.: 329  
##                     Max.   :31.00                      Max.   :3025  
##     campaign          pdays           previous         poutcome        
##  Min.   : 1.000   Min.   : -1.00   Min.   : 0.0000   Length:4521       
##  1st Qu.: 1.000   1st Qu.: -1.00   1st Qu.: 0.0000   Class :character  
##  Median : 2.000   Median : -1.00   Median : 0.0000   Mode  :character  
##  Mean   : 2.794   Mean   : 39.77   Mean   : 0.5426                     
##  3rd Qu.: 3.000   3rd Qu.: -1.00   3rd Qu.: 0.0000                     
##  Max.   :50.000   Max.   :871.00   Max.   :25.0000                     
##       y            
##  Length:4521       
##  Class :character  
##  Mode  :character  
##                    
##                    
## 
sum(is.na(df_bank))
## [1] 0
dim(df_bank)
## [1] 4521   17
str(df_bank)
## 'data.frame':    4521 obs. of  17 variables:
##  $ age      : int  30 33 35 30 59 35 36 39 41 43 ...
##  $ job      : chr  "unemployed" "services" "management" "management" ...
##  $ marital  : chr  "married" "married" "single" "married" ...
##  $ education: chr  "primary" "secondary" "tertiary" "tertiary" ...
##  $ default  : chr  "no" "no" "no" "no" ...
##  $ balance  : int  1787 4789 1350 1476 0 747 307 147 221 -88 ...
##  $ housing  : chr  "no" "yes" "yes" "yes" ...
##  $ loan     : chr  "no" "yes" "no" "yes" ...
##  $ contact  : chr  "cellular" "cellular" "cellular" "unknown" ...
##  $ day      : int  19 11 16 3 5 23 14 6 14 17 ...
##  $ month    : chr  "oct" "may" "apr" "jun" ...
##  $ duration : int  79 220 185 199 226 141 341 151 57 313 ...
##  $ campaign : int  1 1 1 4 1 2 1 2 2 1 ...
##  $ pdays    : int  -1 339 330 -1 -1 176 330 -1 -1 147 ...
##  $ previous : int  0 4 1 0 0 3 2 0 0 2 ...
##  $ poutcome : chr  "unknown" "failure" "failure" "unknown" ...
##  $ y        : chr  "no" "no" "no" "no" ...
head(df_bank)
##   age         job marital education default balance housing loan  contact day
## 1  30  unemployed married   primary      no    1787      no   no cellular  19
## 2  33    services married secondary      no    4789     yes  yes cellular  11
## 3  35  management  single  tertiary      no    1350     yes   no cellular  16
## 4  30  management married  tertiary      no    1476     yes  yes  unknown   3
## 5  59 blue-collar married secondary      no       0     yes   no  unknown   5
## 6  35  management  single  tertiary      no     747      no   no cellular  23
##   month duration campaign pdays previous poutcome  y
## 1   oct       79        1    -1        0  unknown no
## 2   may      220        1   339        4  failure no
## 3   apr      185        1   330        1  failure no
## 4   jun      199        4    -1        0  unknown no
## 5   may      226        1    -1        0  unknown no
## 6   feb      141        2   176        3  failure no
#Converting character values to factors
col_list <- c("job", "marital","education", "default", "housing","loan", 
              "contact","month", "poutcome","y")
for (col in col_list) {
  df_bank[[col]] <- as.factor(df_bank[[col]])
}

Correlations

num_list <- df_bank[,c(1,6,10,12,13,14,15)]
library(corrplot)
C <- cor(num_list)
corrplot(C, method="circle")

We can see that pdays and previous have a high correlation. This suggests that at the time of model building, it is best to include either one of them.

Distribution of the variables

#Job
library(ggplot2)
AA <- ggplot(df_bank, aes(job))
AA <- AA  + geom_histogram(stat="count") + labs(title = "Job")+
  theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5))
## Warning in geom_histogram(stat = "count"): Ignoring unknown parameters:
## `binwidth`, `bins`, and `pad`
AA

table(df_bank$job)
## 
##        admin.   blue-collar  entrepreneur     housemaid    management 
##           478           946           168           112           969 
##       retired self-employed      services       student    technician 
##           230           183           417            84           768 
##    unemployed       unknown 
##           128            38
round((prop.table(table(df_bank$job))*100),1)
## 
##        admin.   blue-collar  entrepreneur     housemaid    management 
##          10.6          20.9           3.7           2.5          21.4 
##       retired self-employed      services       student    technician 
##           5.1           4.0           9.2           1.9          17.0 
##    unemployed       unknown 
##           2.8           0.8

Approximately 60% of the jobs fall into just 3 categories in our dataset: management, blue-collar and technician, suggesting that the variable is not that well-balanced across all levels of the job.

#Marital

BB <- ggplot(df_bank, aes(marital))
BB <- BB  + geom_histogram(stat="count") + labs(title = "Marital")+
  theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5))
## Warning in geom_histogram(stat = "count"): Ignoring unknown parameters:
## `binwidth`, `bins`, and `pad`
BB

table(df_bank$marital)
## 
## divorced  married   single 
##      528     2797     1196
round((prop.table(table(df_bank$marital))*100),1)
## 
## divorced  married   single 
##     11.7     61.9     26.5

Majority individuals are married. This variable is relatively well balanced if we group single with divorced.

#Default

CC <- ggplot(df_bank, aes(default))
CC <- CC  + geom_histogram(stat="count") + labs(title = "Default")+
  theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5))
## Warning in geom_histogram(stat = "count"): Ignoring unknown parameters:
## `binwidth`, `bins`, and `pad`
CC

table(df_bank$default)
## 
##   no  yes 
## 4445   76
round((prop.table(table(df_bank$default))*100),1)
## 
##   no  yes 
## 98.3  1.7

There is only 1.7% of the individuals who have defaulted. The variable’s predictive power would be very less, given that the distribution is highly unbalanced.

#Education

DD <- ggplot(df_bank, aes(education))
DD <- DD + geom_histogram(stat="count") + labs(title = "Education")+
  theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5))
## Warning in geom_histogram(stat = "count"): Ignoring unknown parameters:
## `binwidth`, `bins`, and `pad`
DD

table(df_bank$education)
## 
##   primary secondary  tertiary   unknown 
##       678      2306      1350       187
round((prop.table(table(df_bank$education))*100),1)
## 
##   primary secondary  tertiary   unknown 
##      15.0      51.0      29.9       4.1

51% of the individuals have a secondary education. There are some unknown observations but they are very less (4.1%) and probably they should not hamper the modelling results.

#Loan

EE <- ggplot(df_bank, aes(loan))
EE <- EE + geom_histogram(stat="count") + labs(title = "Loan")+
  theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5))
## Warning in geom_histogram(stat = "count"): Ignoring unknown parameters:
## `binwidth`, `bins`, and `pad`
EE

table(df_bank$loan)
## 
##   no  yes 
## 3830  691
round((prop.table(table(df_bank$loan))*100),1)
## 
##   no  yes 
## 84.7 15.3

Approximately 85% of the individuals did not have a personal loan.

#Housing

FF <- ggplot(df_bank, aes(housing))
FF <- FF + geom_histogram(stat="count") + labs(title = "Housing Loan")+
  theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5))
## Warning in geom_histogram(stat = "count"): Ignoring unknown parameters:
## `binwidth`, `bins`, and `pad`
FF

table(df_bank$housing)
## 
##   no  yes 
## 1962 2559
round((prop.table(table(df_bank$housing))*100),1)
## 
##   no  yes 
## 43.4 56.6

The distribution here is very well balanced between the individuals who have a housing loan vs. who have no housing loan.

#Contact

GG <- ggplot(df_bank, aes(contact))
GG <- GG + geom_histogram(stat="count") + labs(title = "Contact")+
  theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5))
## Warning in geom_histogram(stat = "count"): Ignoring unknown parameters:
## `binwidth`, `bins`, and `pad`
GG

table(df_bank$contact)
## 
##  cellular telephone   unknown 
##      2896       301      1324
round((prop.table(table(df_bank$contact))*100),1)
## 
##  cellular telephone   unknown 
##      64.1       6.7      29.3

64.1% of the individuals were contacted via cellular means whereas only approx. 7% were contacted via telephone with 30% unknown data.

#Month

HH <- ggplot(df_bank, aes(month))
HH <- HH + geom_histogram(stat="count") + labs(title = "Month")+
  theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5))
## Warning in geom_histogram(stat = "count"): Ignoring unknown parameters:
## `binwidth`, `bins`, and `pad`
HH

table(df_bank$month)
## 
##  apr  aug  dec  feb  jan  jul  jun  mar  may  nov  oct  sep 
##  293  633   20  222  148  706  531   49 1398  389   80   52
round((prop.table(table(df_bank$month))*100),1)
## 
##  apr  aug  dec  feb  jan  jul  jun  mar  may  nov  oct  sep 
##  6.5 14.0  0.4  4.9  3.3 15.6 11.7  1.1 30.9  8.6  1.8  1.2

Majority of the individuals were contacted mostly during the summer months with a sizable portion in the month of May (30.9%). It might show that contacting people during certain months can have a favorable or unfavorable impact on the bank’s marketing campaign to subscribe to term deposits.

#Poutcome

II <- ggplot(df_bank, aes(poutcome))
II <- II + geom_histogram(stat="count") + labs(title = "Poutcome")+
  theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5))
## Warning in geom_histogram(stat = "count"): Ignoring unknown parameters:
## `binwidth`, `bins`, and `pad`
II

table(df_bank$poutcome)
## 
## failure   other success unknown 
##     490     197     129    3705
round((prop.table(table(df_bank$poutcome))*100),1)
## 
## failure   other success unknown 
##    10.8     4.4     2.9    82.0

The value is unknown for 82% of the individuals. We could treat these unknowns as missing data (NA), however, these unknowns could be the individuals which the bank has not contacted yet. Therefore, first it is important to check the relation of this variable with our target variable (y) to see how many individuals from this unknown category subscribe to the term deposit.

#distribution of the response variable y

JJ <- ggplot(df_bank, aes(y))
JJ <- JJ + geom_histogram(stat="count") + labs(title = "Response Variable: y")+
  theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5))
## Warning in geom_histogram(stat = "count"): Ignoring unknown parameters:
## `binwidth`, `bins`, and `pad`
JJ

table(df_bank$y)
## 
##   no  yes 
## 4000  521
round((prop.table(table(df_bank$y))*100),1)
## 
##   no  yes 
## 88.5 11.5

Only around 11.5% of respondents to the current campaign have subscribed to the bank’s term deposit. This suggests that the dataset is unbalanced.

#Age

KK <- ggplot(df_bank, aes(age))
KK <- KK + geom_histogram(stat="count") + labs(title = "Age")+
  theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5))
## Warning in geom_histogram(stat = "count"): Ignoring unknown parameters:
## `binwidth`, `bins`, and `pad`
KK

The distribution here seems to be fairly balanced where the majority individuals are aged between 25 to 50 years old.

#Balance
library(ggplot2)
LL <- ggplot(df_bank, aes(balance))
LL <- LL + geom_histogram(stat="count") + labs(title = "Balance") +
  theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5)) + 
  xlim(c(0,20000)) + ylim(c(0,25))
## Warning in geom_histogram(stat = "count"): Ignoring unknown parameters:
## `binwidth`, `bins`, and `pad`
LL
## Warning: Removed 386 rows containing non-finite values (`stat_count()`).
## Warning: Removed 1 rows containing missing values (`geom_bar()`).

median(df_bank$balance)
## [1] 444

The distribution here is very skewed to the right as we can see. We have very high values as well which suggests we will have to deal with outliers. Compared to all the values, the median here is only 444 which is close to zero which suggests that most of the individuals contacted have close to zero balance.

#Campaign

MM <- ggplot(df_bank, aes(campaign))
MM <- MM + geom_histogram(stat="count") + labs(title = "Campaign")+
  theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5))
## Warning in geom_histogram(stat = "count"): Ignoring unknown parameters:
## `binwidth`, `bins`, and `pad`
MM

The distribution here is again skewed to the right with majority of the individuals being contacted only once or twice during this campaign.

#Day

NN <- ggplot(df_bank, aes(day))
NN <- NN + geom_histogram(stat="count") + labs(title = "Day")+
  theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5))
## Warning in geom_histogram(stat = "count"): Ignoring unknown parameters:
## `binwidth`, `bins`, and `pad`
NN

mean(df_bank$day)
## [1] 15.91528
median(df_bank$day)
## [1] 16

The distribution here is very uniform with mean of approx. 15.9 almost equal to median of 16.

#Duration

OO <- ggplot(df_bank, aes(duration))
OO <- OO + geom_histogram(stat="count") + labs(title = "Duration")+
  theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5))
## Warning in geom_histogram(stat = "count"): Ignoring unknown parameters:
## `binwidth`, `bins`, and `pad`
OO

median(df_bank$duration)
## [1] 185

We have some big values here with a right skewed distribution. The median here is 185 seconds (close to 3 minutes), which suggests that most individuals decide about subscrbing to a term deposit or not within the first 3 to 4 minutes of the call duration.

#Pdays

PP <- ggplot(df_bank, aes(pdays))
PP <- PP + geom_histogram(stat="count") + labs(title = "Pdays")+
  theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5))
## Warning in geom_histogram(stat = "count"): Ignoring unknown parameters:
## `binwidth`, `bins`, and `pad`
PP

median(df_bank$pdays)
## [1] -1

The distribution is extremely skewed with a median of perhaps -1. This value means those individuals who have never been contacted before this campaign, that is, most of the individuals were contacted for the very first time during this campaign.

#Previous

QQ <- ggplot(df_bank, aes(previous))
QQ <- QQ + geom_histogram(stat="count") + labs(title = "Previous")+
  theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5))
## Warning in geom_histogram(stat = "count"): Ignoring unknown parameters:
## `binwidth`, `bins`, and `pad`
QQ

median(df_bank$previous)
## [1] 0

This corroborates with the previous graph (pdays) which suggests that there was no communication before with the individuals contacted during this campaign (Median = 0).

Analysis of the independent variables in relation to the response variable

# Job and y

library(ggplot2)
A <- ggplot(df_bank, aes(job,fill = y))
A <- A + geom_histogram(stat="count") + labs(title = "job and y") +
  theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5))
## Warning in geom_histogram(stat = "count"): Ignoring unknown parameters:
## `binwidth`, `bins`, and `pad`
A

We can see that individuals with management job are the ones who have subscribed the maximum to a term deposit followed by technicians.

#Marital and y

B <- ggplot(df_bank, aes(marital,fill = y))
B <- B + geom_histogram(stat="count") + labs(title = "marital and y") +
  theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5))
## Warning in geom_histogram(stat = "count"): Ignoring unknown parameters:
## `binwidth`, `bins`, and `pad`
B

Married individuals have more tendency to subscribe to a term deposit.

#Education and y

C <- ggplot(df_bank, aes(education,fill = y))
C <- C + geom_histogram(stat="count") + labs(title = "education and y") +
  theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5))
## Warning in geom_histogram(stat = "count"): Ignoring unknown parameters:
## `binwidth`, `bins`, and `pad`
C

Individuals with secondary education subscribe the most to a term deposit.

#Default and y

D <- ggplot(df_bank, aes(default,fill = y))
D <- D + geom_histogram(stat="count") + labs(title = "default and y") +
  theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5))
## Warning in geom_histogram(stat = "count"): Ignoring unknown parameters:
## `binwidth`, `bins`, and `pad`
D

Individuals who have no credits in default subscribe to a term deposit whereas for individulas who have credits in default hardly subscribe.

#Housing and y

E <- ggplot(df_bank, aes(housing,fill = y))
E <- E + geom_histogram(stat="count") + labs(title = "housing and y") +
  theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5))
## Warning in geom_histogram(stat = "count"): Ignoring unknown parameters:
## `binwidth`, `bins`, and `pad`
E

Individuals who do not have a housing loan tend to subsribe more to a term deposit compared to the individuals who have a housing loan.

#Loan and y

F <- ggplot(df_bank, aes(loan,fill = y))
F <- F + geom_histogram(stat="count") + labs(title = "loan and y") +
  theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5))
## Warning in geom_histogram(stat = "count"): Ignoring unknown parameters:
## `binwidth`, `bins`, and `pad`
F

Individuals who do not have a personal loan subscribe more to a term deposit.

#Contact and y

G <- ggplot(df_bank, aes(contact,fill = y))
G <- G + geom_histogram(stat="count") + labs(title = "contact and y") +
  theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5))
## Warning in geom_histogram(stat = "count"): Ignoring unknown parameters:
## `binwidth`, `bins`, and `pad`
G

When the contact communication type was Cellular, individuals subscribed more to the bank’s term deposit.

#poutcome and y

H <- ggplot(df_bank, aes(poutcome,fill = y))
H <- H + geom_histogram(stat="count") + labs(title = "poutcome and y") +
  theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5))
## Warning in geom_histogram(stat = "count"): Ignoring unknown parameters:
## `binwidth`, `bins`, and `pad`
H

When the outcome of the previous marketing campaign was a success, individuals subscribe more to the term deposit. However, as mentioned before, the unknown category might represent the new individuals and they are the ones who subscribe the most to the term deposits. Therefore, it’s best to keep this category in for the time being we reach the modelling stage.

#age and y

I <- ggplot(df_bank, aes(age,fill = y))
I <- I + geom_histogram(stat="count") + labs(title = "age and y") +
  theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5))
## Warning in geom_histogram(stat = "count"): Ignoring unknown parameters:
## `binwidth`, `bins`, and `pad`
I

Individuals of age 25-45 years old subscribe the most to term deposits. The same age group has also the highest counts of not subscribing to the term deposit. This suggests that individuals in this age range are contacted the most due to their high presence in our dataset.

#balance and y
library(dplyr)
y_yes <- df_bank %>% filter(df_bank$y =="yes")
y_no <- df_bank %>% filter(df_bank$y =="no")

J <- ggplot(y_yes, aes(balance))
J <- J + geom_histogram(stat="count", binwidth=10) + labs(title = "balance and y_yes") +
  theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5)) + 
  xlim(c(0,3000)) + ylim(c(0,25))
## Warning in geom_histogram(stat = "count", binwidth = 10): Ignoring unknown
## parameters: `binwidth`, `bins`, and `pad`
J
## Warning: Removed 120 rows containing non-finite values (`stat_count()`).
## Warning: Removed 1 rows containing missing values (`geom_bar()`).

K <- ggplot(y_no, aes(balance))
K <- K + geom_histogram(stat="count", binwidth=10) + labs(title = "balance and y_no") +
  theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5)) + 
  xlim(c(0,3000)) + ylim(c(0,25))
## Warning in geom_histogram(stat = "count", binwidth = 10): Ignoring unknown
## parameters: `binwidth`, `bins`, and `pad`
K
## Warning: Removed 865 rows containing non-finite values (`stat_count()`).
## Removed 1 rows containing missing values (`geom_bar()`).

Overall, these two graphs show that individuals who have very low balance don’t usually subscribe to a term deposit and very very few people with a low balance actually subscribe for the deposit.

#duration and y

L <- ggplot(df_bank, aes(duration,fill = y))
L <- L + geom_histogram(stat="count") + labs(title = "duration and y") +
  theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5))+ 
  xlim(c(0,1000))
## Warning in geom_histogram(stat = "count"): Ignoring unknown parameters:
## `binwidth`, `bins`, and `pad`
L
## Warning: Removed 107 rows containing non-finite values (`stat_count()`).

We can observe from the graph that the individuals who do not wish to subscribe for the term deposit are able to decide the same within the first few minutes of the call and those who wish to subscribe for the deposit usually take longer.

This variable has an interesting impact as it highly affects the the response variable. When the duration = 0, then y is always = no i.e. that is there is no term deposit subscription. Yet, the duration is not known before making the call. Also, after the end of the call, y is obviously known. Therefore, in order to have a realistic predictive model, this variable should be discarded from the analysis.

#month and y

M <- ggplot(df_bank, aes(month,fill = y))
M <- M + geom_histogram(stat="count") + labs(title = "month and y") +
  theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5))
## Warning in geom_histogram(stat = "count"): Ignoring unknown parameters:
## `binwidth`, `bins`, and `pad`
M

In the months of May and August, individuals tend to subscribe more to term deposits.

We can similarly check the relation of other numerical variables such as previous, campaign, pdays with the response variable y.

Treatment of outliers

From the analysis done before, we saw that some of the variables have outliers and it is important to deal with them before we proceed with the actual modelling. A very common method of outliers which we can use here is capping where the outliers below the 5th percentile and above the 95th percentile are capped.

As observed, we can perform this capping on the following numeric variables: balance, campaign, pdays and previous.

library(dlookr)
df_bank$balance <- imputate_outlier(df_bank, balance, method = "capping")
df_bank$campaign <- imputate_outlier(df_bank, campaign, method = "capping")
df_bank$pdays <- imputate_outlier(df_bank, pdays, method = "capping")
df_bank$previous <- imputate_outlier(df_bank, previous, method = "capping")

Treatment of Categorical Variables

As seen before, for some of the categorical variables (Marital, Month and Job), the distribution could be improved to have a better predictive performance if we group some of the similar levels together into new broader levels.

library(rockchalk)
## 
## Attaching package: 'rockchalk'
## The following objects are masked from 'package:dlookr':
## 
##     kurtosis, skewness
## The following object is masked from 'package:dplyr':
## 
##     summarize
df_bank$marital <- combineLevels(df_bank$marital, levs = c("single", "divorced"),
                                 newLabel = "unmarried")

df_bank$month <- combineLevels(df_bank$month, levs = c("aug", "sep","oct"), 
                               newLabel = "Autumn")
df_bank$month <- combineLevels(df_bank$month, levs = c("nov", "dec", "jan", 
                                                       "feb", "mar"), 
                               newLabel = "Winter")
df_bank$month <- combineLevels(df_bank$month, levs = c("apr","may", "jun", "jul"),
                               newLabel = "Summer")

df_bank$job <- combineLevels(df_bank$job, levs = c("unemployed", 
                                                   "retired", "student"), 
                             newLabel = "Not Employed")
df_bank$job <- combineLevels(df_bank$job, levs = c("admin.", "management"), 
                             newLabel = "admin and mgnt")
df_bank$job <- combineLevels(df_bank$job, levs = c("blue-collar", "technician"), 
                             newLabel = "blue-collar")
df_bank$job <- combineLevels(df_bank$job, levs = c("housemaid", "services"), 
                             newLabel = "service class")
df_bank$job <- combineLevels(df_bank$job, levs = c("entrepreneur", 
                                                   "self-employed"), 
                             newLabel = "self-employed")

Brief Conclusion

Based on the exploratory data analysis (EDA) done, we have decided to remove two variables from the modelling stage: pdays (since it is highly correlated to the variable previous) and duration (since it highly affects the target variable i.e. if duration = 0 then y = no). Outliers have been dealt with. For the categorical variables, some of the similar levels have been grouped to improve their distribution. The response variable could have been treated as well by making it more balanced by using some advanced packages such as DMwR (function SMOTE), ROSE etc. as a part of advanced EDA but this has not been done here.

Part b)

Fit a regression model (as appropriate) using ‘y’ as the response variable. Model y in terms of age, job, marital and default. Provide the model summary results and write an equation for the fitted model on log-odds scale.

df_bank$y <- ifelse(df_bank$y == "yes", 1, 0)
mod1 <- glm(y ~ age + job + marital + default, data=df_bank,
          family=binomial(link="logit"))
summary(mod1)
## 
## Call:
## glm(formula = y ~ age + job + marital + default, family = binomial(link = "logit"), 
##     data = df_bank)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.8884  -0.5242  -0.4589  -0.3969   2.3596  
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)       -2.212279   0.476216  -4.646 3.39e-06 ***
## age                0.012848   0.004511   2.848  0.00439 ** 
## jobNot Employed   -0.047775   0.437028  -0.109  0.91295    
## jobadmin and mgnt -0.422589   0.428048  -0.987  0.32352    
## jobblue-collar    -0.816166   0.429320  -1.901  0.05729 .  
## jobservice class  -0.730414   0.445387  -1.640  0.10102    
## jobself-employed  -0.666107   0.456582  -1.459  0.14459    
## maritalunmarried   0.467715   0.099489   4.701 2.59e-06 ***
## defaultyes         0.008619   0.361493   0.024  0.98098    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 3231.0  on 4520  degrees of freedom
## Residual deviance: 3163.6  on 4512  degrees of freedom
## AIC: 3181.6
## 
## Number of Fisher Scoring iterations: 5
exp(mod1$coefficients)
##       (Intercept)               age   jobNot Employed jobadmin and mgnt 
##         0.1094509         1.0129306         0.9533485         0.6553477 
##    jobblue-collar  jobservice class  jobself-employed  maritalunmarried 
##         0.4421236         0.4817098         0.5137046         1.5963421 
##        defaultyes 
##         1.0086559

Fitted model equation on log-odds scale:

\(\hat{log(odds(y=1|age,job,marital,default))}\) = -2.212279 + 0.012848age - 0.047775jobNot Employed -0.422589jobadmin and mgnt -0.816166jobblue-collar -0.730414jobservice class -0.666107jobself-employed + 0.467715maritalunmarried + 0.008619defaultyes

Part c)

Based on the model in part b), interpret the regression coefficients for the variables age and default on an appropriate scale.

Interpretation of regression coefficient associated with the variable age

exp(\(\hat{\beta_1}\)) = 1.0129306. It represents the multiplicative effect of age on the odds ratio of subscribing to a term deposit, holding all other variables constant. It is estimated as exp(0.012848) = 1.0129306, i.e. for every one year increase in age the odds of subscribing to a term deposit increases by 1.3%, keeping all other variables unchanged.

Interpretation of regression coefficient associated with the variable default

exp(\(\hat{\beta_8}\)) = 1.0086559. It represents the odds ratio of subscribing to a term deposit for an individual who has defaulted in comparison to an individual who has not defaulted, keeping all other variables fixed. It is estimated as exp(0.008619) = 1.0086559 that is the odds of subscribing to a term deposit are 1.0086559 times higher for individuals who have defaulted compared to individuals who haven’t defaulted, holding all other variables unchanged.

Part d)

Based on the model in part b), what is the estimated odds that an unmarried self-employed individual who is 30 years old and has defaulted will subscribe for the term deposit? What is the estimated probability that a married blue-collared individual who is 50 years old and has not defaulted will subscribe for the term deposit?

Estimated odds

\(\hat{odds(y=1|Marital=unmarried, job=self-employed, age=30, default=yes)}\) = exp(\(\hat{\beta_0}\) + \(\hat{\beta_1}\)*30 + \(\hat{\beta_6}\) + \(\hat{\beta_8}\))

= exp(-2.212279 + 0.012848*30 - 0.666107 + 0.008619)

= exp(-2.484327)

= 0.08338165

Therefore, the estimated odds that an unmarried self-employed individual who is 30 years old and has defaulted will subscribe for the term deposit is 0.08338165.

Estimated Probability

\(\hat{P(y=1|Marital=married, job=blue-collar, age=50, default=no)}\) = \(\frac{exp(\hat\beta_0+\hat\beta_1*50+\hat\beta_4)}{1+exp(\hat\beta_0+\hat\beta_1*50+\hat\beta_4)}\)

= \(\frac{exp(-2.212279+0.012848*50-0.816166)}{1+exp(-2.212279+0.012848*50-0.816166)}\)

= \(\frac{exp(-2.386045)}{1+exp(-2.386045)}\)

= = \(\frac{0.0919928}{1+0.0919928}\)

= 0.08424303

Therefore, the estimated probability that a married blue-collared individual who is 50 years old and has not defaulted will subscribe for the term deposit is 0.08424303.

Part e)

Fit another model which includes the variables age, default, balance, marital, job, campaign and an interaction between balance and marital. Provide the model summary results as obtained in R and write an equation for the fitted model on odds scale.

mod2 <- glm(y ~ age+ default + balance + marital + job + campaign + balance*marital, data=df_bank,
          family=binomial(link="logit"))
summary(mod2)
## 
## Call:
## glm(formula = y ~ age + default + balance + marital + job + campaign + 
##     balance * marital, family = binomial(link = "logit"), data = df_bank)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.9731  -0.5306  -0.4500  -0.3815   2.5995  
## 
## Coefficients:
##                            Estimate Std. Error z value Pr(>|z|)    
## (Intercept)              -2.025e+00  4.837e-01  -4.187 2.83e-05 ***
## age                       1.188e-02  4.539e-03   2.617  0.00886 ** 
## defaultyes                1.206e-01  3.646e-01   0.331  0.74072    
## balance                   8.794e-05  2.978e-05   2.953  0.00314 ** 
## maritalunmarried          4.951e-01  1.208e-01   4.100 4.14e-05 ***
## jobNot Employed          -7.728e-02  4.396e-01  -0.176  0.86047    
## jobadmin and mgnt        -4.187e-01  4.306e-01  -0.972  0.33090    
## jobblue-collar           -7.980e-01  4.319e-01  -1.848  0.06463 .  
## jobservice class         -7.206e-01  4.480e-01  -1.609  0.10771    
## jobself-employed         -6.576e-01  4.591e-01  -1.432  0.15206    
## campaign                 -1.149e-01  2.775e-02  -4.141 3.46e-05 ***
## balance:maritalunmarried -2.422e-05  4.562e-05  -0.531  0.59551    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 3231.0  on 4520  degrees of freedom
## Residual deviance: 3133.1  on 4509  degrees of freedom
## AIC: 3157.1
## 
## Number of Fisher Scoring iterations: 5
exp(mod2$coefficients)
##              (Intercept)                      age               defaultyes 
##                0.1319797                1.0119515                1.1282260 
##                  balance         maritalunmarried          jobNot Employed 
##                1.0000879                1.6407274                0.9256338 
##        jobadmin and mgnt           jobblue-collar         jobservice class 
##                0.6579124                0.4502142                0.4864597 
##         jobself-employed                 campaign balance:maritalunmarried 
##                0.5180970                0.8914525                0.9999758

Fitted model equation on odds scale:

\(\hat{odds(y=1|balance,marital,job,campaign)}\) = exp(-2.025e+00 + 1.188e-02age + 1.206e-01defaultyes + 8.794e-05balance + 4.951e-01maritalunmarried - 7.728e-02jobNot Employed - 4.187e-01jobadmin and mgnt - 7.980e-01jobblue-collar - 7.206e-01jobservice class - 6.576e-01jobself-employed - 1.149e-01campaign - 2.422e-05balance*maritalunmarried)

Part f)

Based on the model in part e), interpret the regression coefficients with the main effects of variables balance and marital on an appropriate scale.

Interpretation of regression coefficient associated with the main effect of variable balance

exp(\(\hat{\beta_3}\)) = 1.0000879. For every one dollar increase in the balance for a married individual, the odds of subscribing to a term deposit is multiplied by a factor of exp(8.794e-05) = 1.0000879, when all other variables remain unchanged.

Interpretation of regression coefficient associated with the main effect of variable marital

exp(\(\hat{\beta_4}\)) = 1.6407274. It represents the odds ratio of subscribing to a term deposit which is multiplied by a factor of exp(4.951e-01) = 1.6407274 for an unmarried individual who has a balance of 0 in his/her bank account, keeping all other variables fixed.

Part g)

What is the estimated odds ratio for a 25 year old married service class individual who has a balance of 1000, has been contacted 3 times during this campaign and has defaulted vs. a 40 year old unmarried blue-collar individual who has a balance of 5000 dollars, has been contacted once during this campaign and has not defaulted?

Estimated odds ratio is calculated as follows:

\(\frac{odds(y=1|age=25, default=yes, balance=1000, marital=married, job=service class, campagin=3)}{odds(y=1|age=40, default=no, balance=5000, marital=unmarried, job=blue-collar, campagin=1)}\)

= \(\frac{exp(\hat\beta_0+\hat\beta_1*25+\hat\beta_2+\hat\beta_3*1000+\hat\beta_8+\hat\beta_{10}*3)}{exp(\hat\beta_0+\hat\beta_1*40+\hat\beta_3*5000+\hat\beta_4+\hat\beta_7+\hat\beta_{10}*1+\hat\beta_{11}*5000*1)}\)

= \(\frac{exp(-2.025e+00+1.188e-02*25+1.206e-01+8.794e-05*1000-7.206e-01-1.149e-01*3)}{exp(-2.025e+00+1.188e-02*40+8.794e-05*5000+4.951e-01-7.980e-01-1.149e-01*1-2.422e-05*5000*1)}\)

= \(\frac{0.07541418}{0.1922421}\)

= 0.3922875

Therefore, the estimated odds ratio for a 25 year old married service class individual who has a balance of 1000, has been contacted 3 times during this campaign and has defaulted vs. a 40 year old unmarried blue-collar individual who has a balance of 5000 dollars, has been contacted once during this campaign and has not defaulted is 0.3922875.

Part h)

Formally compare the models in parts b) and e) using analysis of deviance. What is your conclusion? Which model will you select on the basis of AIC and BIC?

anova(mod1,mod2,test="Chisq")
## Analysis of Deviance Table
## 
## Model 1: y ~ age + job + marital + default
## Model 2: y ~ age + default + balance + marital + job + campaign + balance * 
##     marital
##   Resid. Df Resid. Dev Df Deviance  Pr(>Chi)    
## 1      4512     3163.6                          
## 2      4509     3133.1  3   30.508 1.079e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
mod1$aic
## [1] 3181.603
mod2$aic
## [1] 3157.095
BIC(mod1)
## [1] 3239.352
BIC(mod2)
## [1] 3234.093

From the output above, our simplified model is Model 1 which includes 4 variables: age, job, marital and default. We want to test if Model 1 is an adequate simplification of the complete model (Model 2) which apart from these variables also includes balance, campaign and an interaction between balance and marital.

Hypothesis is defined as follows:

\({H_0}\) : \({\beta_3}\) = \({\beta_{10}}\) = \({\beta_{11}}\) = 0

\({H_1}\) : at least one of \({\beta_3}\), \({\beta_{10}}\), \({\beta_{11}}\) \(\neq\) 0

Deviance = 30.508

p-value = 1.079e-06

Since the p-value is much lower than any reasonable \(\alpha\), we can reject the null hypothesis and conclude that Model 1 is NOT an adequate simplification of the complete model (Model 2).

Since both AIC (3157.095 < 3181.603) and BIC (3234.093 < 3239.352) values are lower for Model 2, therefore Model 2 should be selected.